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Abstract 

Analyzing massive data sets has been one of the key motivations for studying streaming algorithms. 
In recent years, there has been significant progress in analysing distributions in a streaming setting, but 
the progress on graph problems has been limited. A main reason for this has been the existence of 
linear space lower bounds for even simple problems such as determining the connectedness of a graph. 
However, in many new scenarios that arise from social and other interaction networks, the number of 
vertices is significantly less than the number of edges. This has led to the formulation of the semi- 
streaming model where we assume that the space is (near) linear in the number of vertices (but not 
necessarily the edges), and the edges appear in an arbitrary (and possibly adversarial) order. 

However there has been limited progress in analysing graph algorithms in this model. In this paper we 
focus on graph sparsification, which is one of the major building blocks in a variety of graph algorithms. 
Further, there has been a long history of (non-streaming) sampling algorithms that provide sparse graph 
approximations and it a natural question to ask: since the end result of the sparse approximation is a 
small (linear) space structure, can we achieve that using a small space, and in addition using a single 
pass over the data? The question is interesting from the standpoint of both theory and practice and we 
answer the question in the affirmative, by providing a one pass 0{n/e'^) space algorithm that produces 
a sparsification that approximates each cut to a (1 + e) factor. We also show that f2(nlog i) space is 
necessary for a one pass streaming algorithm to approximate the min-cut, improving upon the il{n) 
lower bound that arises from lower bounds for testing connectivity. 

1 Introduction 

The feasibility of processing graphs in the data stream model was one of the early questions investigated in 
the streaming model However the results were not encouraging, even to decide simple properties such 
as the connectivity of a graph, when the edges are streaming in an arbitrary order required Q,{n) space. In 
comparison to the other results in the streaming model, HJUll which required polylogarithmic space, graph 
alogithms appeared to difficult in the streaming context and did not receive much attention subsequently 

However in recent years, with the remergence of social and other interaction networks, questions of 
processing massive graphs have once again become prominent. Technologically, since the publication of 
||9]| . it had become feasible to store larger quantities of data in memory and the semi-streaming model was 
proposed in |l6l|T5l. In this model we assume that the space is (near) linear in the number of vertices (but 
not necessarily the edges). Since its formulation, the model has become more appealing from the contexts 
of theory as well as practice. From a theoretical viewpoint, the model still offers a rich potential trade-off 
between space and accuracy of algorithm, albeit at a different threshold than polylogarithmic space. From 
a practical standpoint, in a variety of contexts involving large graphs, such as image segmentation using 
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graph cuts, the ability of the algorithm to retain the most relevant information in main memory has been 
deemed critical. In essence, an algorithm that runs out of main memory space would become unattractive 
and infeasible. In such a setting, it may be feasible to represent the vertex set in the memory whereas the 
edge set may be significantly larger. 

In the semi-streaming model, the first results were provided by ||6l on the construction of graph spanners. 
Subsequently, beyond explorations of connectivity and (multipass) matching [14], there has been little 
development of algorithms in this model. In this paper we focus on the problem of graph sparsification 
in a single pass, that is, constructing a small space representation of the graph such that we can estimate 
the size of any cut. Graph sparsification lH [171 remains one of the major building blocks for a variety 
of graph algorithms, such as flows and disjoint paths, etc. At the same time, sparsification immediately 
provides a way of finding an approximate min-cut in a graph. The problem of finding a min-cut in a graph 
has been one of the more celebrated problems and there is a vast literature on this problem, including 
both deterministic IT] |8l as well as randomized algorithms lITOl [TTl [T3l [T2l - see Q for a comprehensive 
discussion of various algorithms. We believe that a result on sparsification will enable the investigation of a 
richer class of problems in graphs in the semi-streaming model. 

In this paper we will focus exclusively on the model that the stream is adversarially ordered and a single 
pass is allowed. From the standpoint of techniques, our algorithm is similar in spirit to the algorithm of Alon- 
Matias-Szegedy [T|, where we simultaneously sample and estimate from the stream. In fact we show that in 
the semi-streaming model we can perform a similar, but non-trivial, simultaneous sampling and estimation. 
This is pertinent because sampling algorithms for sparsification exist [2, 17|, which use 0(npolylog(n)) 
edges. However these algorithms sample edges in an iterative fashion that requires the edges to be present 
in memory and random access to them. 

Our Results: Our approach is to recursively maintain a summary of the graph seen so far and use that 
summary itself to decide on the action to be taken on seeing a new edge. To this end, we modify the 
sparsification algorithm of Benczur and Karger [2] for the semi-streaming model. The final algorithm uses 
a single pass over the edges and provides 1 ± e approximation for cut values with high probability and uses 
0(n(log n + log m)(log ^)(1 + e)^/e^) edges for n node and m edge graph. 

2 Background and Notation 

Let G denote the input graph and n and m respectively denote the number of nodes and edges. VAL{C, G) 
denotes the value of cut G in G. wcie) indicates the weight of e in graph G. 

Definition 1 / 2 / A graph is k-strong connected if and only if every cut in the graph has value at least 
k. k-strong connected component is a maximal node-induced subgraph which is k-strong connected. The 
strong connectivity of an edge e is the maximum k such that there exists a k-strong connected component 
that contains e. 

In m, they compute the strong connectivity of each edge and use it to decide the sampling probability. 
Algorithm [T]is their algorithm. We will modify this in section |3] 
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Beiiczur-Karger( lH ) 
Data: Graph G = (F, E) 
Result: Sparsified graph H 

compute the strong connectivity of edge for all e G G; 

{¥,%)■ 
foreach e do 

Pe = min{p/ce, 1}; 

with probability Pe, add eto H with weight 1/pe', 

end 

Algorithm 1: Sparsification Algorithm 

Here p is a parameter that depends on the size of G and the error bound e. They proved the following 
two theorems in their paper. 

Theorem 2.1 [2] Given e and a corresponding p = 16{d + 2) (In n)/e^, every cut in H has value between 
(1 — e) and (1 + e) times its value in G with probability 1 — n"*^. 

Theorem 2.2 [2] With high probability H has 0{np) edges. 

Throughout this paper, ei, 62, • • • , Cm denotes the input sequence. Gi is a graph that consists of ei,e2,- • • ,ei. 
is the strong connectivity of e in G and wcie) is weight of an edge e in G. Gij = {e : e G Gi, 2^~^ < 

d^^^ < 2^}. Each edge has weight 1 in Gij. Fij = J2k>j'^''~^^i,j where scalar multiplication of a 
graph and addition of a graph is defined as scalar multiplication and addition of edge weights. In addition, 
H G (l±e)Gif andonlyif (l-e)yAL(C,G) < VAL{C,H) < {l + e)VAL{C,G). is a sparsification 
of a graph Gi, i.e., a sparsified graph after considering in the streaming model. 

3 A Semi-Streaming Algorithm 

We cannot use Algorithm[T]in the streaming model since it is not possible to compute the strong connectivity 
of an edge in G without storing all the data. The overall idea would be to use a strongly recursive process, 
where we use an estimation of the connectivity based on the current sparsification and show that subsequent 
addition of edges does not impact the process. The modification is not difficult to state, which makes us 
believe that such a modification is likely to find use in practice. The nontrivial part of the algorithm is in the 
analysis, ensuring that the various dependencies being built into the process does not create a problem. For 
completeness the modifications are presented in Algorithm |2] 

Stream-Sparsification 

Data: The sequence of edges ei, 62, • • • , Cm 

Result: Sparsified graph H 

^ 0; 
foreach e do 

compute the connectivity Ce of e in ff; 

Pe = min{p/ce, 1}; 

add eto H with probability p^ and weight 1/pe', 

end 

Algorithm 2: Streaming Sparsification Algorithm 

We use p = 32((4+(i) In n+ln m) (1+e) / given e > 0; once again d is a constant which determines the 
probability of success. We prove two theorems for Algorithm|2l The first theorem is about the approximation 
ratio and the second theorem is about its space requirement. For the simplicity of proof, we only consider 
sufficiently small e. 
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Theorem 3.1 Given e > 0, H is a sparsification, that is € (1 =b e)G, with probability 1 — 0{\/n'^). 
Theorem 3.2 If H G (1 ± e)G, H has 0{n{d\ogn + logm)(logm - logn)(l + ef /e^) edges. 

We use a sequence of ideas similar to that in Benczur and Karger |j2|. Let us first discuss the proof in 

m. 

In that paper, Theorem |2.1| is proved on three steps. First, the result of Karger Qj], on uniform sampling 
is used. This presents two problems. The first is that they need to know the value of minimum cut to get 
a constant error bound. The other is that the number of edges sampled is too large. In worst case, uniform 
sampling gains only constant factor reduction in number of edges. 

To solve this problem, Benczur and Karger [2] decompose a graph into fc-strong connected components. 
In a fc-strong connected component, minimum-cut is at least k while the maximum number of edges in k- 
strong connected component(without {k + l)-strong connected component as its subgraph) is at most kn. 
They used the uniform sampling for each component and different sampling rate for different components. 
In this way, they guarantee the error bound for every cut. 

We cannot use Karger's result ifTTl directly to prove our sparsification algorithm because the probability 
of sampling an edge depends on the sampling results of previous edges. We show that the error bound of 
a single cut by a suitable bound on the martingale process. Using that we prove that if we do not make an 
error until i"^ edge, we guarantee the same error bound for every cut after sampling (i + 1)*^ edge with high 
probability. Using union bound, we prove that our sparsification is good with high probability. 



4 Proof of Theorem 3.1 



4.1 Single Cut 

We prove Theorem 13. II first. First, we prove the error bound of a single cut in Lemma |4~T] The proof will 
be similar to that of Chernoff bound [4J. p in Lemma l4!4l is a parameter and we use different p for different 
strong connected components in the later proof. 

Lemma 4.1 Let C = {cj^ , e^j , ■ ■ ■ > Cj, } with ii < 12 < ■ ■ ■ < iibe a cut in a graph G such that wci&ij ) < 
1 and VAL{G, G) = c. The index of the edges corresponds to the arrival order of the edges in the data 
stream. Let Ac be an event such that Pe > p for all e € C Let H be a sparsification of G. Then, 
ic A {\VAL{G,H) - c| > /3c)] < 2exp(-/3V/4)/or any < /3 < 2e - L 



Let Xj = pwniei ) and fij = ^[Xj] = pwciei ). Then, \VAL{C,H) — c\ > j3c if and only if 
I Xj — pc\ > /3pc. As already mentioned, we cannot apply Chernoff bound because there are two 
problems: 

1. Xj are not independent from each other and 

2. values of Xj are not bounded. 

The second problem is easy to solve because we have Ac. Let Yj be random variables defined as follows: 



Xj if Pe,^ > P 

fij otherwise. 
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If Ac happens, Yj = Xj. Thus, 

F[Ac A {\VAL{C, H)-c\> /3c)] = P[ylc A (| J] X, - ^ > ()pc)] 

j 3 

= P[ylcA(|5^y,- j;^,|>/3pc)] 

j 3 

< p[|^y,-^^,-| >/3pc] (1) 

The proof of ([T]) is similar to Chemoff bound f4]. However, since we do not have independent Bernoulli 
random variables, we need to prove the upperbound of E[exp(t J2j ^)] given t. We start with E,[exp{tYj)]. 

Lemma 4.2 E[exp(tYj)|ifjj._i] < exp(^j(e* — \)) for any t and Hi. ^i. 

Proof: There are two cases. Given Hi.^i,P(.^ > p or pe, < P- At the end of each case, we use the fact 
that 1 + X < e"^. 

Case 1 : If pe^. <p,Yj = fij. 

'E[exp{tYj)\Hi^^i] = exp{tnj) 

< exp(^j(e* - 1)). 

Case 2 : If pe,^. > P, Yj = Xj. So E[exp{tYj)\Hi^^i] = pe^. exp{tiJ.j/pe^^) + (1 - Pe,^)- Let 
f{x) = X exp{tfXj /x) + (1 — x). Observe that f'{x) < for x > 0. So f{x) is decreasing function. Also 
we have fij = pwciei.) < p < Pe,. since wciei.) < 1. Hence, 

Pei. exp{.tHj/Pei. ) + (1 - Pe,^ ) < ^J■j exp(t) + {1- fij). 

Therefore, 

B[exp{tYj)\H,^^i] < ^j(exp(t)-l) + l 

< exp(^j(e* - 1)). 

From case 1 and 2, E[exp(tY^)|ifj ._i] < exp(/ij(e* — 1)) for any //j -i. □ 
Now, we prove the upperbound of E[exp(t J2j ^)]- 

Lemma 4.3 Let Sj = Yl[=j ^k- For any t and Hi._i, B[exp{tSj)\Hi^^i] < exp{J2k=j - 1))- 

Proof: We prove by induction. For j = /, E[exp(tS'j)|//j,_i] = E[exp(tYi)|ffj,_i] < exp(^i(e* — 1)) 
by Lemma l4!2l 
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Assume that E,[exp{tSj+i)\Hi^_^^-i] < exp{J2[^j_^_^ /xfc(e* - 1)) for any -ffj^+i-i- Then, 

y fi^j+i-i 



Hi 11- 



<k=j+l 



exp ( Y ) nYj\Hi^-i 

,fc=i+i 




Therefore, E[exp(tS'j)|i7i^_i] < exp(X;Li - 1)) for any and t. □ 

Now we prove Lemma l4~n Remember that we only need to prove P[| J2j Yj~pc\ > /^pc] < 2 exp(— /3^pc/4) 
by ©. 

Proof: [Proof of Lemma |4~]1 Let S = Si = J2j ^^'^ = Ylj l^i — P'^- prove in two parts: 

F[S > (1 + < exp(-/3V/4) and F[S < (1 - (3)fi] < exp(-/3V/4)- 

We prove F[S > (1 + P)fi\ < exp(— /3^/x/4) first. By applying Markov's inequality to exp(t5) for any 
t > 0, we obtain 



exp(t(l + (3)fi) 
^ exp(//(e* - 1)) 
- exp(f(l + /3);u) ■ 



The second line is from Lemma 14.31 From this point, we have identical proof as Chernoff bound ||4J that 
gives us bound exp(—/3^///4) for /3 < 2e — 1. To prove that P [5 < (1 — < exp(—/?^pc/4) we applying 
Markov's inequality to exp(— tS*) for any t > 0, and proceed similar to above. Using union bound to these 
two bounds, we obtain a bound of 2 exp(— /3^/x/4). □ 



4.2 A;-strong Connected Component 

Now we prove the following lemma given a fc-strong connected component and parameter p. This corre- 
sponds to the proof of uniform sampling method in ifTTl . 



Lemma 4.4 Let Q be a k-strong component such that each edge has weight at most 1. Hq is its sparsified 
graph. Let (3 = y^4((4 + d) In n + In m) /pk for some constant d > 0. Suppose that Aq be an event 
such that every edge in Q has sampled with probability at least p. Then, F[Aq A {Hq ^ (1 zt e)Q)] = 
0{l/n^+'^m). 

Proof: Consider a cut C whose value is ak in Q. If Aq holds, every edge in C is also sampled with 
probability at least p. By Lemma ITT] F[Aq A \VAL{C,Hq) - ak\ > I3ak] < 2 exp(-/32paA;/4) = 
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Let F{a) be the number of cuts with value less or equal to ak. By union bound, we have 

clF 

F[Aq a {Hq ^ (1 ± e)Q)] < P(1)F(1) + n<^)^da. 

The number of cuts whose value is at most a times minimum cut is at most n^". Since the value of minimum 
cut of Q is k, F{a) < n^". Since P is a monotonically increasing function, this bound is maximized when 
F{a) = n^". Thus, 

dF 

¥[AQA{HQi{l±e)Q)\ < F{1)P{1) + J^ ^("^d^^" 

/oo 
P(a)(2n2" lnn)da 

2 Inn 
< \- / —n^T-^^ — zda 



0(l/n2+'^m). 



□ 



4.3 Error Bound for Hi and H 

Lemma 4.5 The probability of i being the first integer such that Hi ^ (1 it e)Gj is 0{l/n'^m). 

Proof: If G (1 ± (5)Gj for all j < i, < (1 + e)cif < (1 + e)cif ^. Remember that cf' denotes 
the strong connectivity of e in graph G. 

oo 

Hi = ^ Hij 

j=-oo 



Hij + (l/2)Fij+i is a sparsification of Gij + (l/2)Fj j+i = Fj j. Fij consists of 2^ -strong connected 

(G) ■ 

components. For every e € Gjj, Ce < 2-'. So it is sampled with probability at least p = p/{l + e)2^ . If we 
consider one 2-'~^-strong connected component and set p = 32((4 + d) Inn + lnm)(l + e)/e^, by Lemma 
14.41 every cut has error bound e/2 with probability at least 1 — C'(l/n^+'^m). Since there are less than 
such distinct strong connected components, with probability at least 1 — 0{l/n'^m), Hi j + (l/2)Fj j+i G 
(1 ± /3)Fij for every Hence, 

oo oo 

H, e (l±e/2)F,,- -F„+i 

j=— oo j=— oo 

C {2±e)Gi-Gi 
= {l±e)Gi. 

Therefore, P[(Vj < i.Hj G (1 ± e)Gj) A [H ^ (1 ± e)Gi)] = ©(l/n'^m). □ 
FromLemmagll Theorem HI] is obvious. F[H ^ (1 ± e)G] < E'i"iIP[(Vi < i-Hj G (1 ± e)Gj) A 
^ (1 ± e)G,)] = ©(l/n'^). 
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5 Proof of Theorem 



3.2 



For the proof of Theorem l3.2[ we use the following property of strong connectivity. 

Lemma 5.1 [2^] If the total edge weight of graph G is n{k — 1) or higher, there exists a k-strong connected 
components. 

Lemma 5.2 G (1 it e)G, total edge weight of H is at most (1 + e)m. 

Proof: Let be a cut ({?;}, V - {v}). Since H e {I ± e)G, VAL{Cy, H) < {1 + e)VAL{Cy, G). 
Total edge weight of H is {^^^y VAL{Cy, H))/2 since each edge is counted for two such cuts. Similarly, 
G has {Y^vdV V^L{Gy, H))/2 = m edges. Therefore, if i7 € (1 ± e)G, total edge weight of H is at most 
(1 + e)m. □ 

Let Ek = {e : e ^ H and Cg < k}. Ej^is a set of edges that sampled with Cg = k. We want to bound 
the total weight of edges in Ej.. 

Lemma 5.3 Y^eeEk ''^H{e) < n{k + k/p). 

Proof: Let H' be a subgraph of H that consists of edges in Ek- H' does not have {k + k/p + l)-strong 
connected component. Suppose that it has. Then there exists the first edge e that creates a. {k + k / p + l)- 
strong connected component in H'. In that case, Cj must be in the (A;+fc/p+l)-strong connected component. 
However, since weight e is at most k/p, that component is at least {k + l)-strong connected without e. This 
contradicts that Cg < k. Therefore, H' does not have any {k + k/p + l)-strong connected component. By 
Lemma l5?n Y^eEk "^Hie) < n{k + k/p). □ 

Now we prove Theorem l3.2l 

Proof: [Proof of Theorem 13.21 If the total edge weight is the same, the number of edges is maximized 
when we sample edges with smallest strong connectivity. So in the worst case, 

wnie) = nk{l + p) - n{k - + p) = n(l + p). 

eeE^.—Ek_i 

In that case, k is at most (1 + e)m/n(l + !//>). Let this value be km- Then, total number of edges in H is 

^-^ I 

i=l 

0{n{p+l)log{km)) 
0{np(log m — log n)) 

0{n{dlog n + log m)(log m — log n)(l + e)^/e^). 

□ 

6 Space Lower bounds 

First, we prove a simple space lowerbound for weighted graphs, where the lowerbound depends on e. 

Theorem 6.1 For < e < 1, il(n(logC + log ^)) bits are required in order to sparsify every cut of a 
weighted graph within (1 ± e) factor where C is maximum edge weight and 1 is minimum edge weight. 



E 

i=l 



n{l + l/p) 
i/p 
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Proof: Let F be a set of graphs such that there is a center node u and other nodes are connected to u by 

an edge whose weight is one of 1, , , ■ ,C. Then, |F| = (log^ C)"-^ For G, G' G F, 

they must have different sparsifications. So we need ^(log \F\) bits for sparsfication. It is easy to show that 
log|F| = J7(n(logC + logi)). □ 
Now we use the same proof idea for unweighted simple graphs. Since we cannot assign weight as we 
want, we use n/2 nodes as a center instead of having one center node. In this way, we can assign degree of 
a node from 1 to n/2. 

Theorem 6.2 For < e < 1, r2(n(log n + log ^)) bits are required in order to sparsify every cut of a graph 
within (lie). 

Proof: Consider bipartite graphs where each side has exactly n/2 nodes and each node in one side has 
a degree 1, (i^f ) ) (^^) ' ' n/2. For each degree assignment, there exists a graph that satisfies it. 

/ \ri-l 

Let F be a set of graphs that has different degree assignments. Then, \F\ = (^log^ i+e ■) § j . G,G' ^ F 

cannot have the same sparsification. So we need at least ^{log \F\) = Q{n{logn + log i)) bits. □ 
Another way of viewing the above claim is a direct sum construction, where we need to use ri(log i) 
bits to count upto a precision of (1 + e). 

7 Conclusion and Open Problems 

We presented a one pass semi-streaming algorithm for the adversarially ordered data stream model which 
uses 0{n{d log n + log m) (log m — log n) (1 + e)^/e^) edges to provide e error bound for cut values with 
probability 1 — 0{l/n'^). If the graph does not have parallel edges, the space requirement reduces to 
0{dnlog^ n(l + e)^/e^). We can solve the minimum cut problem or other problems related to cuts with 
this sparsification. For the minimum cut problem, this provides one-pass {{1 + e)/{l — e))-approximation 
algorithm. 

A natural open question is to determine how the space complexity of the approximation depends on e. 
Our conjecture is that the bound of n/e^ is tight up to logarithmic factors. 
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